Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks.
One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, the goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants, and and predict the manner in which participants did the exercise. Participants were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
training <- read.csv("pml-training.csv", na.strings = c("", "NA"))
testing <- read.csv("pml-testing.csv", na.strings = c("", "NA"))
# We hided results in the final output because it's too long
head(training,10)
dim(training)
## [1] 19622 160
Our obeservations: 1) the dataset is quite large. It also has a lot of variables. 2) Each sensor captures multiple datapoints. Some variables has a lot of NA values. They might not be useful and we should consider exlcude these variables. 3) The name of variable ends with the name of the particular sensor that is providing the datapoint. This allows us to group variables together and possibily conducts a PCA to reduce dementions.
# We hided results in the final output because it's too long
library(dplyr)
col_na_count <- training %>%
select(everything()) %>%
summarise_all(funs(sum(is.na(.))))
col_na_count
Here we have count of NA in each column. As we can see, for columns with NA, NA is the dominated value - there are 19216 NAs out of 19622 rows. Let’s remove columns with more than half rows NA.
# We hided results in the final output because it's too long
training_filter <- training %>%
select(-which(col_na_count > 0.5 * nrow(training)))
head(training_filter)
Now we are down to 60 variables.
Among the predictors, we can see they can be grouped in to groups based on their names. Let’s group them and check the correlation among variables. First let’s take a look at Belt.
# We hided results in the final output because it's too long
training_belt <- training_filter %>% select(contains("belt"))
cor(training_belt)
head(training_belt)
As we can see, a lot of variables are highly correlated, such as -0.99. We can use Principle Componenet Analysis to reduce the dementions. Notice the scales of these variables are very different and there are both positive and negative values, we will need to center and scale our data as well.
library(caret)
training_belt_pcamodel <- preProcess(training_belt, method = c("center", "scale", "pca"), thresh = 0.90)
training_belt_pcamodel
## Created from 19622 samples and 13 variables
##
## Pre-processing:
## - centered (13)
## - ignored (0)
## - principal component signal extraction (13)
## - scaled (13)
##
## PCA needed 4 components to capture 90 percent of the variance
As we can see, we only need 4 components do perserve 90% of the variance. This will decrease reduce our data dimentions dramatically.
# apply PCA model
training_belt_pca <- predict(training_belt_pcamodel, newdata = training_belt)
# rename the columne names
names(training_belt_pca) <- c("belt_pca1","belt_pca2","belt_pca3","belt_pca4")
Next we will repeat the process for the rest three group of variable - forearm, arm, and dumbell.
# Create model for forearm
training_forearm <- training_filter %>% select(contains("forearm"))
training_forearm_pcamodel <- preProcess(training_forearm, method = c("center", "scale", "pca"), thresh = 0.90)
training_forearm_pcamodel
## Created from 19622 samples and 13 variables
##
## Pre-processing:
## - centered (13)
## - ignored (0)
## - principal component signal extraction (13)
## - scaled (13)
##
## PCA needed 8 components to capture 90 percent of the variance
# We need 8 variables to perserve 90% variance
training_forearm_pca <- predict(training_forearm_pcamodel, newdata = training_forearm)
names(training_forearm_pca) <- c("forearm_pca1","forearm_pca2","forearm_pca3","forearm_pca4", "forearm_pca5","forearm_pca6","forearm_pca7","forearm_pca8")
# Create model for arm
training_arm <- training_filter %>% select(contains("_arm"))
training_arm_pcamodel <- preProcess(training_arm, method = c("center", "scale", "pca"), thresh = 0.90)
training_arm_pcamodel
## Created from 19622 samples and 13 variables
##
## Pre-processing:
## - centered (13)
## - ignored (0)
## - principal component signal extraction (13)
## - scaled (13)
##
## PCA needed 7 components to capture 90 percent of the variance
# We need 7 variables to perserve 90% variance
training_arm_pca <- predict(training_arm_pcamodel, newdata = training_arm)
names(training_arm_pca) <- c("arm_pca1","arm_pca2","arm_pca3","arm_pca4", "arm_pca5","arm_pca6","arm_pca7")
# Create model for dumbbell
training_dumbbell <- training_filter %>% select(contains("_dumbbell"))
training_dumbbell_pcamodel <- preProcess(training_dumbbell, method = c("center", "scale", "pca"), thresh = 0.90)
training_dumbbell_pcamodel
## Created from 19622 samples and 13 variables
##
## Pre-processing:
## - centered (13)
## - ignored (0)
## - principal component signal extraction (13)
## - scaled (13)
##
## PCA needed 6 components to capture 90 percent of the variance
# We need 6 variables to perserve 90% variance
training_dumbbell_pca <- predict(training_dumbbell_pcamodel, newdata = training_dumbbell)
names(training_dumbbell_pca) <- c("dumbbell_pca1","dumbbell_pca2","dumbbell_pca3","dumbbell_pca4", "dumbbell_pca5","dumbbell_pca6")
Now let’s combine all the variables after PCA.
training_pca <- data.frame(user_name = training[, 2], classe = training$classe, training_belt_pca, training_arm_pca, training_forearm_pca, training_dumbbell_pca )
Nowe we have 33 variables. Let’s do some Exploratory Data Analysis.
table(training_pca$user_name, training_pca$classe)
##
## A B C D E
## adelmo 1165 776 750 515 686
## carlitos 834 690 493 486 609
## charles 899 745 539 642 711
## eurico 865 592 489 582 542
## jeremy 1177 489 652 522 562
## pedro 640 505 499 469 497
Each of the 6 participants performed 5 classes of workout.
# plot pc1 of belt, forearm, and dumbbell, color by users
library(plotly)
plot_ly(training_pca, x = ~belt_pca1, y = ~forearm_pca1, z = ~dumbbell_pca1, color = ~user_name)
In this plot, we can see that user is differently a very important factor. No matter which classe is, a user’s movements seem to be in similar range, therefore grouped together.
plot_ly(subset(training_pca, user_name = "adelmo"), x = ~belt_pca1, y = ~forearm_pca1, z = ~dumbbell_pca1, color = ~classe)
In this plot, we only look at Adelmo’s movements. We can see that: 1) each classes has different range of movements. 2) There might be some outliers in the range of movement. For example, one data point has extremely large dumbbell_pca1. However, Because we only plotted 3 vairables, it’s hard to see if these are real outliers, so we decided to keep them.
First, let’s slice the training data into training and validation sample.
# Define train control for k fold cross validation
# train_control <- trainControl(method="cv", number= 5, savePredictions = TRUE)
set.seed(123)
index <- createDataPartition(training$classe, p = 0.6, list = FALSE)
pca_tra <- training_pca[index,]
pca_val <- training_pca[-index,]
This is a mutiple calssification problem. The models we can use are tree based models and k-nearest neighbours.
#Random Forest
model1 <- train(classe ~., data= pca_tra, method="rf")
pca_val$pred_rf <- predict(model1, newdata = pca_val, type = "raw")
confusionMatrix(pca_val$classe, pca_val$pred_rf)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2216 8 3 2 3
## B 56 1449 12 1 0
## C 3 32 1322 11 0
## D 6 0 62 1210 8
## E 0 0 11 7 1424
##
## Overall Statistics
##
## Accuracy : 0.9713
## 95% CI : (0.9674, 0.9749)
## No Information Rate : 0.2907
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9637
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9715 0.9731 0.9376 0.9829 0.9923
## Specificity 0.9971 0.9891 0.9929 0.9885 0.9972
## Pos Pred Value 0.9928 0.9545 0.9664 0.9409 0.9875
## Neg Pred Value 0.9884 0.9937 0.9864 0.9968 0.9983
## Prevalence 0.2907 0.1898 0.1797 0.1569 0.1829
## Detection Rate 0.2824 0.1847 0.1685 0.1542 0.1815
## Detection Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Balanced Accuracy 0.9843 0.9811 0.9652 0.9857 0.9948
#KNN Model
ctrl <- trainControl(method="repeatedcv",repeats = 3)
model2 <- train(classe ~., data= pca_tra, method="knn", trControl = ctrl)
pca_val$pred_knn <- predict(model2, newdata = pca_val, type = "raw")
confusionMatrix(pca_val$classe, pca_val$pred_knn)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2182 24 15 9 2
## B 81 1379 45 11 2
## C 17 30 1283 29 9
## D 5 4 68 1205 4
## E 2 16 15 20 1389
##
## Overall Statistics
##
## Accuracy : 0.948
## 95% CI : (0.9429, 0.9528)
## No Information Rate : 0.2915
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9342
##
## Mcnemar's Test P-Value : 1.762e-12
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9541 0.9491 0.8997 0.9458 0.9879
## Specificity 0.9910 0.9783 0.9868 0.9877 0.9918
## Pos Pred Value 0.9776 0.9084 0.9379 0.9370 0.9632
## Neg Pred Value 0.9813 0.9883 0.9779 0.9895 0.9973
## Prevalence 0.2915 0.1852 0.1817 0.1624 0.1792
## Detection Rate 0.2781 0.1758 0.1635 0.1536 0.1770
## Detection Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Balanced Accuracy 0.9725 0.9637 0.9432 0.9668 0.9898
Random Forest model has higher accuracy than the KNN model. It has 97% accuracy. We will use this model to predict the testing data.
col_na_count_testing <- testing %>%
select(everything()) %>%
summarise_all(funs(sum(is.na(.))))
testing_filter <- testing %>%
select(-which(col_na_count_testing > 0.5 * nrow(testing)))
# Create pca variables with "belt"
testing_belt <- testing_filter %>% select(contains("belt"))
testing_belt_pca <- predict(training_belt_pcamodel, newdata = testing_belt)
names(testing_belt_pca) <- c("belt_pca1","belt_pca2","belt_pca3","belt_pca4")
# Create pca variables with "forearm"
testing_forearm <- testing_filter %>% select(contains("forearm"))
testing_forearm_pca <- predict(training_forearm_pcamodel, newdata = testing_forearm)
names(testing_forearm_pca) <- c("forearm_pca1","forearm_pca2","forearm_pca3","forearm_pca4", "forearm_pca5", "forearm_pca6", "forearm_pca7", "forearm_pca8")
# Create pca variables with "_arm"
testing_arm <- testing_filter %>% select(contains("_arm"))
testing_arm_pca <- predict(training_arm_pcamodel, newdata = testing_arm)
names(testing_arm_pca) <- c("arm_pca1","arm_pca2","arm_pca3","arm_pca4", "arm_pca5", "arm_pca6", "arm_pca7")
# Create pca variables with "dumbbell"
testing_dumbbell <- testing_filter %>% select(contains("dumbbell"))
testing_dumbbell_pca <- predict(training_dumbbell_pcamodel, newdata = testing_dumbbell)
names(testing_dumbbell_pca) <- c("dumbbell_pca1","dumbbell_pca2","dumbbell_pca3","dumbbell_pca4", "dumbbell_pca5", "dumbbell_pca6")
# Combine groups of categories
testing_pca <- data.frame(user_name = testing[, 2], testing_belt_pca, testing_arm_pca, testing_forearm_pca, testing_dumbbell_pca )
testing_pca$pred <- predict(model1, newdata = testing_pca, type = "raw")
testing_pca$pred
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E